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1. Introduction 

1.1. The regression problem 

In this paper, we study the linear regression problem: we observe n pairs (Xi, F,) 
with Yi = f(Xi) + Si for a noise e = (e\, . . . , e n ) to be specified later. 

The idea is that the statistician is given (or chooses) a dictionary of functions: 
(/i, . . . , f m ), with possibly m > n, and he wants to build a "good" estimation 
of / of the form ai/i H h a m f m . 

Actually, we have to precise two things: what is the distribution of the pairs 
(Xi,Yi), and what is the criterion for a "good" estimation. We are going to 
consider two cases. 

1.2. Deterministic and random design 

1.2.1. Deterministic design case 

In this case the values Xi,...,X n are deterministic, and the e, are i. i. d. 
according to some distribution IP with E £ ^p(e) = and E er ^p(e 2 ) < oo. In this 
case, the distance between / and atifi + • • • + a m f m will be measured in terms 
of the so-called empirical norm. 
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Definition 1.1. For any a = (cti, . . . , a m ) G R m and a 1 = (a[, . . . , a' m ) G R r 
we put 



and 



2 1 " 
I™ _ Z S 



i = l 



^n,./; ; ;.V.i ^r/.IV.V.i 



j=i 



1 " 

f„ G arg min — > 

a£R m n Z — ' 
i=l 



/(*0-$>i/i(*i) 



1.2.2. Random design case 

In this case, we assume that the pairs (Xi,Yi) arc i. i. d. according to some 
distribution P, that the marginal distribution of every X, is Px, and that we 
still have E(x,y)~p(e) = and E(x,y)~p(e 2 ) < oo. The distance will be measured 
by the £ 2 distance with respect to Px ■ 

Definition 1.2. For any a, a' G R m we put 



\ a - a '\\x = E X~Pj 



and 



ax G arg min Ex~p x 

a£R m 



f(X)-J2^f 3 (X) 

3 = 1 



Moreover, we make the following restrictive hypothesis: the statistician 
knows Px- 

1.3. General notations 



Now, we assume that we are in one of the two cases defined previously. However, 
as the results we want to state are the same in both settings, we introduce the 
following notation. 

Definition 1.3. We introduce the general norm 

\\a - ol'Wgn 

that is simply \\a — Qf'||„ if we are in the deterministic design case and and 
\\a — o/\\ x if we are in the random design case. Moreover, we will let a denote 
a n or ax according to the case. 

In any case, we let P denote the distribution of the sample (Xi, ii)i=i,...,n- 
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In order so simplify the notations, we assume that the functions fj of the 
dictionary are normalized, in the sense that ^ X)"=i fj(-^-i) = 1 if we are in the 
deterministic design case and that E x ~ Px [fj(X)f = 1 if we are in the random 
design case. Note that this could be simply written in terms of the general norm: 
if we put e\ = (1, 0, . . . , 0),. . . , e m = (0, . . . , 0, 1) the canonical basis of R m , we 
just have to assume that for any j £ {1, . . . , m}, 1 1 e ^ 1 1 Giv = 1- 

Finally, let us mention that (., -) GN will denote the scalar product associated 
to the norm |.|| gn while we will use the notation ||.|| for the euclidian norm in 
R m and (., .} for the associated scalar product. 

1.4- Previous works and organization of the paper 

The aim of this paper is to propose a method to estimate the real regression 
function (say /) on the basis of the dictionary (/i,...,/ m ), that have good 
performances even if to > n. 

Recently, a lot of algorithms have been proposed for that purpose, let's cite 
among others the bridge regression by Frank and Friedman [14], and a particular 
case of bridge regression called LASSO by Tibshirani [19], some variants or 
generalization like LARS by Efron, Hastie, Johnstone and Tibshirani [13], the 
Dantzig selector by Candes and Tao [9] and the Group LASSO by Bakin [3], 
Yuan and Lin [21] and Chesneau and Hebiri [11] or iterative algorithms like 
Iterative Feature Selection in our paper [2] or greedy algorithms in Barron, 
Cohen, Dahmen and DeVore [4]. This paper proposes a general method that 
contains LASSO, Dantzig selector and Iterative Feature Selection as a particular 
case. 

Note that in the case where m/n is small, we can use the ordinary least square 
estimate. The risk of this estimator is roughly in m/n. But when m/n > 1, this 
estimator isn't even properly defined. The idea of all the mentioned works is 
the following: if there is a "small" vector space F C R m such that a £ F, one 
could build a constrained estimator with a risk in dim(F)/n. But can we obtain 
such a result if F is unknown? For example, a lot of papers study the sparsity 
of a, this means that F is the span of a few e^, or, in other words, that a have 
only a small number (say p) of non-zero coordinates: an estimator that selects 
automatically p relevant coordinates and achieving a risk close to p/n is said to 
satisfy a "sparsity oracle inequality" . A paper, by Bickel, Ritov and Tsybakov 
[5] gives sparsity oracle inequalities for the LASSO and the Dantzig selector 
in the case of the deterministic design. Another paper by Bunea, Tsybakov 
and Wegkamp [8] gives sparsity oracle inequalities for the LASSO. This paper 
is written in a more general context than ours: random design with unknown 
distribution (in the case of a random design, remember that our method require 
the knowledge of the distribution of the design). However, the main results 
require the assumption ||/j||oo < L for some given L, what is not necessary 
in our paper, and prevents the use of popular basis of functions like wavelets. 
This is due to the use of Hoeffding's inequality in the technical parts of the 
paper. 
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Our paper uses a geometric point of view. This allows to build a general 
method of estimation and to obtain simple sparsity oracle inequalities for the 
obtained estimator, in both deterministic design case and random design with 
known distribution. It uses a (Bernstein's type) deviation inequality proved in a 
previous work [2] that is sharper than Hoeffding's inequality, and so gets rid of 
the assumption of a (uniform) bound over the functions of the dictionary. An- 
other improvement is that our method is valid for some types of data-dependant 
of dictionaries of functions, for example the case where m = n and 

{/!(.), . . . , / m (.)} = {K(X U .),..., K(X n , .)} 

where K is a function X 2 —> R, performing kernel estimation. 

In Section 2, we give the general form for our algorithm under a particular 
assumption, Assumption (CRA), that says we are able to build some confidence 
region for the best value of a in some subspace of R m . 

In Section 3, we show why Iterative Feature Selection (IFS), LASSO, Dantzig 
Selector among others are particular cases of our algorithm. We exhibit another 
particular case of interest (called the Correlation Selector in this paper). More- 
over, we prove some oracle inequalities for the obtained estimators: roughly, 
LASSO, Dantzig Selector and IFS performs well when the vector a is sparse 
(which means that a lot of its coordinates, (a, ej) = ctj are equal to zero) or 
approximately sparse (a lot of coordinates are nearly equal to zero), while the 
Correlation Selector performs well when a lot of (a,ej) GN are almost equal to 
zero (in the deterministic design case, {a, ej) GN = E(^ X^=i fj{Xi)Yi) while in 
the random design case, (a,ej) GN = E(fj(X)Y), so in any case, this quantity 
is a measure of the correlation between the variable Y and the j-th function in 
the dictionary). So, intuitively, the Correlation Selector gives good results when 
most of the functions in the dictionary have weak correlation with Y , but we 
expect that altogether these functions can bring a good prediction for Y . 

In order to prove oracle inequalities, some types of orthogonality (or approx- 
imate orthogonality, in some sense) are required on the dictionary of functions. 
Our results are the following: under orthogonality on the dictionary of func- 
tions, and using only general properties of our family of estimators, we have a 
sparse oracle inequality. Under an approximate orthogonality condition taken 
from Bickel, Ritov and Tsybakov [5], the result can be extended for the LASSO 
and the Dantzig selector (with a proof taken from [5]). Some remarks by Huang, 
Cheang and Barron [15] show that these results can be extended to IFS with 
a slight modification of the estimator. Finally, the central result for the Corre- 
lation Selector docs not require any hypothesis on the dictionary of functions 
but concerns a measure of the risk that is not natural, we obtain a result on the 
risk measured by ||.||gjv under an assumption very close to the one in [5] - here 
again, the proof uses only general properties of our family of estimators. 

Section 4 is dedicated to simulations: we compare ordinary least square 
(OLS), LASSO, Iterative Feature Selection and the Correlation Selector on a toy 
example. Simulations shows that both particular cases of our family of estima- 
tors (LASSO and Iterative Feature Selection) generally outperforms the OLS 
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estimate. Moreover, LASSO performs generally better than Iterative Feature 
Selection, however, this is not always true: this fact leads to the conclusion that 
a data-driven choice of a particular algorithm in our general family could lead 
to optimal results. 

After a conclusion (Section 5), Section 6 is dedicated to some proofs. 



2. General projection algorithms 

2.1. Additional notations and hypothesis 

Definition 2.1. Let C be a closed, convex subs et ofR d . We let U§ N (.) denote 
the orthogonal projection on C with respect to the norm \\.\\gn- 

n c N ( a ) = ar S™j n H Q - PWgn- 

For a generic distance 5, we will use the notation n^(.) for the orthogonal pro- 
jection on C with respect to 5. 

We put, for every j G {1, . . . , to}: 

Mj = {a e R m , £ j => a t = 0} = {aej, a e R} . 
Definition 2.2. We put, for every j £ {1, . . . , m}: 



ot> = arg min ||a — aejUcAf = 'X a )- 

aeMj 3 

Moreover let us put: 

1 ™ 

ctj = — fj(Xj)Yj and a-* = cxjCj. 
n i=i 

Remark 2.1. In the deterministic design case (||.||gjv = II -|U) w e have 

-^/ ; i.V, ; /:.V, 



n 

i=l 



and in the random design case we have 

c? = Ex~p x [fjWfWjej 
so in any case, a? is an estimator of a 3 . 

Hypothesis (CRA) We say that the confidence region assumption (CRA) is 
satisfied if for e £ [0, 1] we have a bound r(j, e) € R such that 



P 



... 2 

Vj e {!,..., to}, m. 3 -a J < r(j,e) 



x 



>l-s. 



In our previous work [2] we examined different hypothesis on the probability 
P such that this hypothesis is satisfied. For example, using inequalities by Catoni 
[f 0] and Panchenko [17] we proved the following results. 
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Lemma 2.1. Let us assume that ||/||oo < L for some known L. Let us as- 
sume that Ep(e 2 ) < a 2 for some known a 2 < oo. Then Assumption (CRA) is 



satisfied, with 



4(1 + log 



1 



Remark 2.2. It is also shown in [2] that we are allowed to take 

{/iO U)} = {K(Xt, .),..., K(X n ,.)} 

for some function K : X 2 — ► R (this allows for f(x) a kernel estimator of the 
form Y^i=i a i-K(Xi-, x)), even in the random design case, but we have to take 



4(1 + log if) 



m — * J 



i = l 



in this case. 



Lemma 2.2. Let us assume that there is a K > such that P(\Y\ < K) = 1. 
Then Assumption (CRA) is satisfied with 



8K 2 (1 + log - ; 



Definition 2.3. When (CRA) is satisfied, we define, for any e > and j G 
{1, . . . , m}, the random set 



CK(j,e) = { 



a E R' 



H^(a) - # 



< r 



GN 



This can easily be interpreted: Assumption (CRA) says that there is a con- 
fidence region for oP in the small model A4j; ClZ(j,e) is the set of all vectors 
falling in this confidence region when they are orthogonally projected on M.j. 

We remark that the hypothesis implies that 



P 



Vje{l,...,M}, a€CK(j,e) 



> 1 



2.2. General description of the algorithm 

We propose the following iterative algorithm. Let us choose a confidence level 
e > and a distance on X, say S(., .). 

• Step 0. Choose d(0) = (0, . . . , 0) 6 R m . Choose e e [0, 1]. 

• General Step (k). Choose N(k) < M and indices (j{ , ■ ■ ■ , ijy ) £ 
{1,...,M} N W and put: 

£ arg min 5(a, a(k — 1)) . 
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This algorithm is motivated by the following result. 
Theorem 2.3. When the CRA assumption is satisfied we have: 



P 



Vfc G tM, 5(a(k),a) < 6(&(k- l),a) < ■•• < tf(d(0),a) > 1 - e. (2.1) 



So, owr algorithm builds a sequence of d(fc) that gets closer to a (according to 
8) at every step. Moreover, if 5{x,x') = \\x — x'\\ GN then 



i(fc-l)) 



and we /iaue £/ie following: 

k 

3 VfeGN, ||&(A)-a||^<||d(0)-a||^-X;||d(j)-&(j-l)ll 



2 

GJV 



>l-£. 



Proof. Let us assume that 

VjG {!,..., M}, 



I GJV 



<r(Sj,e). 



This is true with probability at least 1 — e according to assumption ( CRA ). In 
this case we have seen that 

N(k) 

ae pi CK{j?\e) 
t=i 

that is a closed convex region, and so, by definition, S (d(fc), a) < 5 (d(fc — l),a) 
for any fee N. If 5 is the distance associated with the norm ||.||gjv, let us choose 
k G N, 



||d(fc) - a|| GAr 



T GJV 



«(*-!)) 



GJV 



< ||d(fc-l)-a|| GJV 



A recurrence ends the proof. 



n 



GJV 



(d(fc-i)) -d(fc-i) 



GJV 



||d(fc - 1) - a|| GJV - ||d(fc) - a(k - 1)\\ GN ■ 



□ 



Remark 2.3. We choose our estimator d = d(fc) for some step k G IN; the choice 
of the stopping step k will depend on the particular choices of the projections 
and is detailed in what follows. But remark that there is no bias-variance 
balance involved in the choice of k as Theorem 2.3 shows that overfitting 
is not possible for large values of k. 
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3. Particular cases and oracle inequalities 

We study some particular cases depending on the choice of the distance <$(.,.) 
and on the sets we project on. 

Roughly, LASSO and Iterative Feature Selection (at least as introduced in 
[2]) correspond to the choice 6 (a, a') = \\a — a'\\cN, an d are studied first. 

Dantzig selector corresponds to the choice 6 (a, a') = \\a— a'\\i the i\ distance, 
it is studied in a second time. 

Finally, the new Correlation Selector corresponds to another choice for 5. 



3.1. The LASSO 



Here, we use only one step where we project onto the intersection of all the 
confidence regions and so we obtain: 

d^d(i) = np£ cme) (o). 



The optimization program to obtain a L is given by: 
J argmin Q , = ( Q , lr .. iCtm)eR m ||a||e A 
\ s. t. a G f)™ =1 CK(£,e) 



and so: 



argminagiRm ||a|| 



2 

GN 



[s.t. Vj e {1, . . . ,m}, \(a,ej) GN - atj] < y/r(j,e) 
Proposition 3.1. Every solution of the program 



(3.1) 



argmin Mgn ~ 2 X! a ^ + 2 XV r (-?' £ ) \ a j\ \ ( 3 - 2 ) 

3=1 3=1 



satisfies Program 3.1. Moreover, all the solutions a of Program 3.1 have the same 
risk value \\a — a||grjy. Finally, in the deterministic design case, Program 
3.2 is equivalent to: 

n - m -2m *| 

-E y< "E a i^) + 2 Y,Vt&e)\<*i\\- ( 3 - 3 ) 
i=l L j=i J j=i ) 

The proof is given ay the end of the paper (in Subsection 6.1 page 1147). 

Note that, if r(j,e) does not depend on j, Program 3.3 is exactly one of the 
formulations of the LASSO estimator studied first by Tibshirani [19]. In the 
particular deterministic design case, this dual representation was already 
known and introduced by Osborne, Prcsncll and Turlach [Hi]. 

However, in the cases where r(j,e) is not constant, the difference with the 
LASSO algorithm is the following: the harder the coordinates are to be esti- 
mated, the more penalized they are. 

Moreover, note that the program 3.2 gives a different from of the usual LASSO 
program for the cases where we do not use the empirical norm. 
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3.2. Iterative Feature Selection (IFS) 

As in the LASSO case we use the distance 5(a,0) = \\a — (3\\gn- 

The only difference is that instead of taking the intersection of every con- 
fidence region, we project on each of them iteratively. So the algorithm is the 
following: 

d(o) = (o,...,o) 

and at each step k we choose a j{k) £ {1, . . . , m} and 

«(fc) = ng4 me) (d(fc-i)). 

We choose a stoping step k and put 

a IFS = a(k). 

This is exactly the Iterative Feature Selection algorithm that was introduced 
in Alquier [2], with the choice of j(k): 



j(k) = argmax &(k - 1) - Hcn(j,e) ("0 - *)) 



GN 



and the suggestion to take as a stopping step 

k = inf {k e IN*, \\a(k) - a(k - l)\\ GN < n} 
for some small k > 0. 

Remark 3.1. In Alquier [2], it is proved that: 

fi(fc) = d(fc - 1) + s fl n OSfc) (|/3 fc | - v/r(j(fc),e)) e i(fc) (3.4) 



where 



n 



n 

i=l 



r,-^a(fc) £ / f (A 4 ) . 
1=1 

So this algorithm looks quite similar to a greedy algorithm, as it is described 
by Barron, Cohen, Dahmcn and DeVore [4]. Actually, it would be a greedy 
algorithm if we replace r(j, e) by (such a choice is however not possible here): 
it is a soft-thresholded version of a greedy algorithm. Such greedy algorithms 
were studied in a recent paper by Huang, Cheang and Barron [15] under the 
name "penalized greedy algorithm", in the case ||.||gjv = 

Note that in Iterative Feature Selection, every selected feature actually im- 
proves the estimator: ||d(fc) — o||qjv 

< ||d(fc - 1) — c^IIqjy (Equation 2.1). 

3.3. The Dantzig selector 

The Dantzig selector is based on a change of distance S. We choose 

m 

5(a, a') = || a — a'\\i = |u 

3=1 
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As is the LASSO case, we make only one projection onto the intersection of 
every confidence region: 

a DS £ arg min |M|i 
and so a DS is the solution of the program: 

m 

arg min > a,-| 

a=( ai ....,a m )GR m ' 



s. t. Vj e {l,...,m}, | (a, ej) GN - &j \ < 1/ r(j, e). 

In the case where r(J,s) does not depend on j, and where ||.||gjv = ll-IU, this 
program is exactly the one proposed by Candes and Tao [9] when they intro- 
duced the Dantzig selector. 

3-4- Oracle Inequalities for the LASSO, the Dantzig Selector and 
IFS 

Definition 3.1. For any S C {1, . . . , m} let us put 

Ms = {a <E R m , j S => ctj = 0} 

and 

as = arg min || a — ct || gat. 

Every Ms is a submodel of R m of dimension |5| and as is the best approx- 
imation of a in this submodel. 

Theorem 3.2. Let us assume that assumption (CRA) is satisfied. Let us as- 
sume that the functions f\, . . . , f m are orthogonal with respect to (., -) GN - In this 
case the order of the projections in Iterative Feature Selection does not affect 
the obtained estimator, so we can set 



a — Ll C1Z{m,s) ■ ■ ■ li C-R(l,e) U - 



Then 



a IFS =a L = a DS = 



i=i 

is a soft-thresholded estimator, and 



m 

J^sgn (otj) (joj| - V r(j, s) J . 



\a L - a\\ 2 < inf 

' UGN SC{l,..,m} 



I as - a\\ 2 GN + 4^ r(j,e) 
jes 



> 1 -£. 



For the proof, see Subsection 6.2 page 1149. 
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Remark 3.2. We call "general regularity assumption with order /? > and 
constant C > 0" : 

Vj e {1, . . . ,m}, inf || as - a\\ GN < Cj~ P . 

S C {1) • • • j m} 
|S| < j 

This is the kind of regularity satisfied by functions in weak Besov spaces, see 
Cohen [12] and the references therein, with fj being wavelets. If the general 
regularity assumption is satisfied with regularity (3 > and constant C > and 
if there is a k > such that 

rtf,e) < -, 



i /2Hog^\wr /4fclog^\] 
=ILv < + 1) (^_^ j + j | > ! - , 

Now, note that the orthogonality assumption is very restrictive. Usual results 
about LASSO or Dantzig Selector usually involve only approximate orthogonal- 
ity, see for example Cands and Tao [9], Bunea [6], Bickel, Ritov and Tsybakov [5] 
and Bunea, Tsybakov and Wegkamp [8] , and sparsity (the fact that a lot of the 
coordinates of a arc null), as for example the following result, which is a small 
variant of a result in [s], that is reminded here in order to provide comparison 
with the results coming later in the paper. 

Theorem 3.3 (Variant of Bunea, Tsybakov and Wegkamp [8]). Let us assume 
that we are in the deterministic design case, that assumption (CRA) is 
satisfied, and that r(j,e) = r(e) does not depend on j (this is always possible by 
taking r(e) = sup j£ m m i r(J, e) ). Moreover, we assume that there is a constant 
D such that, for any a £ T m , 




IMIg*>0|MI 



where 



Then 



T m = I a G R r 



E i«ii^ 3 E m}- 

: :otj— j-.ctj^O ) 



16 

Pi \\a L - a < — 
gn - jyi 



{] : a, ± 0} r(e) }>l-e. 



The only difference with the original result in [8] is that r(e) is given in a 
general form here, so we are allowed to use different values for r(e) depending 
on the context, see the discussion of Hypothesis (CRA) in the beginning of the 
paper. Similar results are available for the Dantzig selector, see Candes and Tao 
[9], and Bickel, Ritov and Tsybakov [5]. 
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Remark 3.3. We can wonder how IFS performs when the dictionary is not 
orthogonal. Actually, the study of penalized greedy algorithm in Huang, Cheang 
and Barron [15] leads to the following conclusion in the deterministic design 
case: there are cases where IFS can be really worse than LASSO. However, 
the authors proposes a modification of the algorithm, called "relaxed penalized 
greedy algorithm" ; if we apply this modification here we obtain 

a(k) = j k a{k- 1) + sgn (J3 k ) (\0 k \ - yjr(j(k), ej) e ](k) 

instead of equation 3.4, where 

1 " 



n 
»=i 



and at each step we have to minimize the empirical least square error with 
respect to "f k G [0,1]. Such a modification ensures that the estimators given by 
the A;-th step of the algorithm become equivalent to the LASSO when k grows, 
for more details see [15] (note that the interpretation in terms of confidence 
regions and the property \\a(k) — a\\ GN < \\a(k — 1) — S||qjv are ^ os * with this 
modification). 



3.5. A new estimator: the Correlation Selector 

The idea of the Correlation Selector is to use 



i=i 

We make only one projection onto the intersection of every confidence region: 
a cs G arg min Utiles 

and so d is a solution of the program: 

m 

arg min ^ ( ej , a)% N 

Q= ai....,Q m €R m 



s. t. Vj G {1, . . . ,m}, \{a,e J ) GN - aj\ < y/r{j,e). 

This program can be solved for every Uj = (ej, a) GN individually: each of them 
is solution of 

f argmin„ \u\ 2 



s. t. Vj G {1, . . . ,m}, \u -&j\ < y/r(j,e). 



P. Alquier/ 'LASSO, IFS and Correlation Selector 



1142 



As a consequence, 

Uj = (ej,a cs ) = sgn (a,) (\6ij\- y/r(j, e)) 

that docs not depend on p. Note that Uj is a thresholdcd estimator of the 
correlation between Y and fj(X), this is what suggested the name "Correlation 
Selector" . Let us put U the column vector that contains the Uj for j £ {1, . . . , m} 
and M the matrix ((e^, &j) GN )i,j, then d is just any solution of a M = U. 

Remark 3.4. Note that the Correlation Selector has no reason to be sparse, 
however, the vector d M is sparse. An interpretation of this fact is given in 
the next subsection. 



3. 6. Oracle inequality for the Correlation Selector 
Theorem 3.4. We have: 



P 



\a cs -a\\ cs < 



inf 

SC{l,...,ro} 



\j£S jes I 



> 1 -£. 



Moreover, if we assume that there is a D > such that for any a € £ m , 
\\ where 

£ m = {a e R m , (a, ej) GN = => (a, e 3 ) = 0} 



MIgjv — D \\a\\ where 



then we have 
P 



x.CS 



a\\^ rN < t^t inf 
llGN D 2 SC{l,...,m} 



E^' e j)GJv+ 4 E r ^'' e ) 

\s<ts jes / 



> 1 -£. 



The proof can be found in Subsection 6.3 page 1150. 

Remark 3.5. Note that the result on \\a cs — a\\ cs does not require any as- 
sumption on the dictionary of functions. However, this quantity does not have, 
in general, an interesting interpretation. The result about the quantity of inter- 



est, lid* 75. 



o 



GN 



requires that a part of the dictionary is almost orthogonal, 



this condition is to be compared to the one in Theorem 3.3. 

Remark 3.6. Note that if there is a S such that for any j ^ S, {a,ej) GN 
and if r(j,s) = fclog(m/e)/n then we have: 



P 



_ ||2 4fc|S|logf 
aL„ < — 

MOO n 



> 1 - 



and if moreover for any a 6 £ m , \\ct\\ GN > D \\a\\ then 



P 



I A Q Q I I 2 

I" - a \\GN^ 



4fc|S'|logf 



> 1 
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The condition that for a lot of j, (a, £j) GN = means that most of the functions 
in the dictionary are not correlated with Y. In terms of sparsity, it means that 
the vector aM is sparse. So, intuitively, the Correlation Selector will perform 
well when most of the functions in the dictionary have weak correlation with Y, 
but we expect that altogether these functions can bring a reasonable prediction 
for Y. 

4. Numerical simulations 
4-1. Motivation 

We compare here LASSO, Iterative Feature Selection and Correlation Selector 
on a toy example, introduced by Tibshirani [19]. We also compare their perfor- 
mances to the ordinary least square (OLS) estimate as a benchmark. Note that 
we will not propose a very fine choice for the r(j, e). The idea of these simula- 
tions is not to identify a good choice for the penalization in practice. The idea is 
to observe the similarity and differences between different order in projections 
in our general algorithm, using the same confidence regions. 

4-2. Description of the experiments 

The model defined by Tibshirani [19] is the following. We have: 

Vie {1,..., 20}, Y = 

with Xi G X = IR 8 , (3 G R 8 and the e, are i. i. d. from a gaussian distribution 
with mean and standard deviation a. 

The XiS are i. i. d. too, and each Xi comes from a gaussian distribution with 
mean (0, . . . , 0) and with variance-covariance matrix: 

E(p) = (pl*-^) i6{1 8} 

)e{i 8} 

for p e [0,1[. 

We will use the three particular values for j3 taken by Tibshirani [19]: 

(3 1 = (3,1.5,0,0,2,0,0,0), 

j3 2 = (1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5), 

P 3 = (5,0,0,0,0,0,0,0), 

corresponding to a "sparse" situation (/3 1 ), a "non-sparse" situation (/3 2 ) and a 
"very sparse" situation (/3 3 ). 

We use two values for a: 1 (the "low noise case") and 3 (the "noisy case"). 

Finally, we use two values for p: 0.1 ("weakly correlated variables") and 0.5 
( "highly correlated variables" ) . 
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We run each example (corresponding to a given value of /3, a and p) 250 times. 
We use the software R [ i ] for simulations. We implement Iterative Feature 
Selection as described in subsection 3.2 page 1138, and the Correlation Selector, 
while using the standard OLS estimate and the LASSO estimator given by the 
LARS package described in [13]. Note that we use the estimators defined in the 
deterministic design case, this means that we consider ||.||g./v = ||-||n (the 
empirical norm) as our criterion here. The choice: 



was not motivated by theoretical considerations but seems to perform well in 
practice. 

4-3. Results and comments 

The results are reported in Table 1. 

The following remarks can easily be made in view of the results: 

• both methods based on projection on random confidence regions using 
the norm ||.||gjv = ||-||n clearly outperform the OLS in the sparse cases, 
moreover they present the advantage of giving sparse estimates; 

• in the non-sparse case, the OLS performs generally better than the other 
methods, but LASSO is very close, it is known that a better choice for the 
value r(j,e) would lead to a better result (see Tibshirani [19]); 

• LASSO seems to be the best method on the whole set of experiments. In 
every case, it is never the worst method, and always performs almost as 
well as the best method; 

• in the "sparse case" (/3 1 ), note that IFS and LASSO are very close for the 
small value of p. This is coherent with the previous theory see Theorem 
3.2 page 1139; 

• IFS gives very bad results in the non-sparse case (/3 2 ), but is the best 
method in the sparse case (/3 3 ). This last point tends to indicate that 
different situations should lead to a different choice for the confidence 
regions we are to project on. However, theoretical results leading on that 
choice are missing; 

• the Correlation Selector performs badly on the whole set of experiments. 
However, note that the good performances for LASSO and IFS occurs for 
sparse values of (3, and the previous theory ensures good performances for 
C-SEL when (3M is sparse where M is the covariance matrix of the X,. 
In other words, two experiments where favorable to LASSO and IFS, but 
there was no experiment favorable to C-SEL. 

In order to illustrate this last point, we build a new experiment favorable to 
C-SEL. Note that we have 




Yi = (X t ,0) +e t = (XiM- 1 ,^) + £l 



(4.1) 
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Table 1 

Results of Simulations. For each possible combination of f3, a and p, we report in a column 
the mean empirical loss over the 250 simulations, the standard deviation of this quantity 
over the simulations and finally the mean number of non-zero coefficients in the estimate, 

for ordinary least square ( OLS), LASSO, Iterative Feature Selection (IFS) and Correlation 

Selector (C-SEL) 



p 




P 


OLS 


LASSO 


IFS 


C-SEL 


(sparse) 


3 


0.5 


3.67 

1.84 

8 


1.64 

1.25 
4.64 


1.56 

1.20 
4.62 


3.65 

1.96 

8 




1 


0.5 


0.40 

0.22 

8 


0.29 

0.19 
5.42 


0.36 

0.23 
5.70 


0.44 

0.23 

8 




3 


0.1 


3.75 

1.86 

8 


2.72 

1.50 
5.70 


2.85 

1.58 
5.66 


3.44 

1.72 

8 




1 


0.1 


0.40 

0.19 

8 


0.30 

0.19 
5.92 


0.31 

0.19 
5.96 


0.43 

0.20 

8 


(non sparse) 


3 


0.5 


3.54 

1.82 

8 


3.36 

1.64 
7.08 


4.90 

1.58 
6.57 


3.98 

1.85 

8 




1 


0.5 


0.41 

0.21 

8 


0.54 

0.93 
7.94 


0.84 

0.36 
7.89 


0.47 

0.24 

8 




3 


0.1 


3.78 

1.78 

8 


3.82 

1.51 
7.06 


4.50 

1.59 
7.03 


4.01 

1.86 

8 




1 


0.1 


0.40 

0.20 

8 


0.42 

0.29 
7.98 


0.71 

0.32 
7.98 


0.48 

0.22 

8 


P 3 

(very sparse) 


3 


0.5 


3.55 

1.79 

8 


1.65 

1.28 
4.48 


1.59 

1.27 
4.49 


3.42 

1.74 

8 




1 


0.5 


0.40 

0.21 

8 


0.18 

0.14 
4.46 


0.17 

0.14 
4.48 


0.46 

0.25 

8 




3 


0.1 


3.46 

1.74 
8 


1.69 

1.29 
4.92 


1.62 

1.18 
4.92 


3.00 

1.45 

8 




1 


0.1 


0.40 

0.20 

8 


0.20 

0.14 
4.98 


0.19 

0.14 
4.91 


0.44 

0.24 

8 



where M is the correlation matrix of the Xi. Let us put Xi = X{M 1 and 
ft = ftM, we have the following linear model: 

Y i = (X i ,ft}+s i . (4.2) 

The sparsity of ft gives advantage to the LASSO for estimating ft in Model 4.1, 
it also gives an advantage to C-SEL for estimating ft in Model 4.2 (according 
to Remark 3.4 page 1142). 

We run again the experiments with ft = ft 3 and this time we try to estimate 
ft instead of ft (so we act as if we had observed Xi and not Xi). 
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Table 2 

Results for the estimation of 0. As previously, for each possible combination of a and p, we 
report in a column the mean empirical loss over the 250 simulations, the standard deviation 
of this quantity over the simulations and finally the mean number of non-zero coefficients in 
the estimate, this for each estimate: OLS, LASSO, IFS and C-SEL 





cr 


P 


OLS 


LASSO 


IFS 


C-SEL 


(sparse) 


3 


0.5 


3.64 

1.99 

8 


4.83 

2.53 
5.98 


5.12 

2.64 
6.05 


2.41 

1.92 

8 




1 


0.5 


0.41 

0.21 

8 


1.09 

1.72 
7.11 


0.92 

0.48 
7.40 


0.26 

0.19 

8 




3 


0.1 


3.65 

1.71 

8 


3.71 

1.96 
6.25 


3.72 

1.99 
6.28 


2.09 

1.40 

8 




1 


0.1 


0.40 

0.20 

8 


0.47 

0.25 
7.35 


0.55 

0.16 
7.38 


0.23 

0.27 

8 



Results are given in Table 2. 

The correlation selector clearly outperforms the other methods in this case. 
5. Conclusion 

5.1. Comments on the results of the paper 

This paper provides a simple interpretation of well-known algorithms of statis- 
tical learning theory in terms of orthogonal projections on confidence regions. 
This very intuitive approach also provides tools to prove oracle inequalities. 

Simulations shows that methods based on confidence regions clearly outper- 
forms the OLS estimate in most examples. Actually, the theoretical results and 
the experiments lead to the following conclusion: in the case where we think 
that a is sparse, that means, if we assume that only a few functions in the 
dictionary are relevant, we should use the LASSO or the Dantzig Selector (we 
know that these estimators are almost equivalent since [5]); IFS can be seen as 
a good algorithmic approximation of the LASSO in the orthogonal case. In the 
other cases, we should think of another method of approximation (LARS, re- 
laxed greedy algorithm. . . ). When 5M is sparse, i. e. almost all the functions in 
the dictionary are uncorrelated with Y, then we the Correlation Selector seems 
to be a reasonable choice. This is, in some way, the "desperate case" , where 
for example for various reason a practitioner thinks that he has the good set of 
variables to explain Y, but he realizes that only a few of them are correlated 
with Y and that methods based on the selection of a small subset of variables 
(LASSO, . . . ) leads to unsatisfying results. 

5.2. Extentions 

First, note that all the results given here in the deterministic design case ( || . || gn = 
||.||„) and in the random design case (||-||gjv = ||-||x) can be extended to another 
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kind of regression problem: the transductive case, introduced by Vapnik [20]. 
In this case, we assume that m more pairs (X n+ i,Y n+ i),. . . , (X n+m , Y n+m ) are 
drawn (i. i. d. from P), and that X n+ i,. . . , X n+m are given to the statistician, 
whose task is now to predict the missing values Y n +\,. . . , Y n+m . Here, we can 
introduce the following criterion 

-i 2 

m m 

y £,a j f j (X i )-Y,°S j f j (X i ) . 

3 = 1 3 = 1 

In [2], we argue that this case is of considerable interest in practice, and we 
show that Assumption (CRA) can be satisfied in this context. So, the reader 
can check that all the results in the paper can be extended to the case ||.||gat = 

• || frans- 

Also note that this approach can easily be extended into general statistical 
problems with quadratic loss: in our paper [1], the Iterative Feature Selection 
method is generalized to the density estimation with quadratic loss problem, 
leading to a proposition of a LASSO-likc program for density estimation, that 
have also been proposed and studied by Bunea, Tsybakov and Wegkamp [7] 
under the name SPADES. 

5.3. Future works 

Future works on this topic include a general study of the projection into the 
intersection of the confidence regions 

J argmin a= ( aii ... iam)eR md(a,0) 

[s.t. Vj e {l,...,m}, | (a, ej) GN -&j\ < ^ r(j, e) 

for a generic distance S (.,.). 

A generalization to confidence regions defined by grouped variables, that 
would include the Group LASSO studied by Bakin [3], Yuan and Lin [21] and 
Chesneau and Hebiri [11] as a particular case is also feasible. 

A more complete experimental study, including comparison of various choices 
for 6(., .) and for r(J,e) based on theoretical results or on heuristics would be of 
great interest. 

6. Proofs 

6. 1 . Proof of Proposition 3. 1 

Proof. Let us remember program 3.1: 
maxaeiRm -||a|| G jv 

s. t. \fj e {!,..., m}, \{a,e J ) GN - a 3 \ < y/r(j,e). 



ll«-"'llLm s = — £ 
i— n-\-l 
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Let us write the lagrangian of this program: 
C(a, A, n) = - ^2 ^2 aiaj (e 4 , e 3 ) c 



I GN 



i 3 



+ Y X 3 Y ai ( e ^ e o)GN - & j~V r C?' £ ) 
j L i 

+ " Y Ul ( e ^ e j)GN + a 3 - Vr{j,e) 

j L i 

with, for any j, Xj > 0, fj,j > and Xjfij = 0. Any solution (a*) of Program 3.1 
must satisfy, for any j, 

dC 

= g^(a*,X,iJ,) = -2^2 a *i ( e i^ e j} G N + ^Z( X i ~ Mi) (ei,ej) GW , 

3 i i 

so for any j, 

2 \ / GN 

Note that this also implies that: 



(6.2) 



= a i \ ei > Y ~ / i i) e J") = Y 9^ ~ Y a i et 

i \ j / GN j Z \ j 



GN 



Using these relations, the lagrangian may be written: 
C(a*,X,fx) = - YY^ Xl -Vi)\( X i ~ H)( e h e i)GN 

i 3 

+ YY\( Xl ~ ^( X 3 ~ Vj) ( e ii e 3 ) G N 
i 3 

Y^ X 3 ~ Vl)®! + Y( X 3 + Vj)V r (j,£)- 
3 3 

Note that the condition Xj > 0, fa > and Xj/ij = means that there is a 
jj G IR such that 7j = 2(Aj — Hj), \jj\ = 2(Xj + and so fj,j = (tj/2)_ and 
Xj = (7j/2) + . Let also 7 denote the vector which j-th component is exactly 7^, 
we obtain: 

C(a*,X,n) = h\\ GN - 2E7j"j + 2 Y l^'l V r ti,e) 

i i 
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that is maximal with respect to the Xj and fj,j, so with respect to 7. So 7 is a 
solution of Program 3.2. 

Now, note that Equation 6.2 ensures that any solution a* of Program 3.1 
satisfies: 

I GN 

We can easily see that a* = 7 is a possible solution. 

In the case where ||.||gw is the empirical norm ||.||„ we obtain: 



ITIIgjv 



! E 

i=i 



1 n 

i E 

n ^ 

i=l 
1 " 

i E 

n ^ 

i=l 



3=1 



1 ™ 

T7 ^- > 



^-E^-(^) 



i=i Lj=i 



1 " 



□ 



6.2. Proof of Theorem 3.2 

Proof. In the case of orthogonality, we have H-Hgn 
So a L satisfies, according to its definition: 



the euclidian norm. 



arg mm N a 

a=(di....,Q m )gR""^ 

3=1 



s. t. Vj 6 {1, . . . , ?7i}, |ay — ctj I < VW) 
while a 155 satisfies: 



arg mm 



E 



i. t. Vj G {1, . . . , to}, \atj- &j I < y/r(j,e). 

We can easily solve both problem by an individual optimization on each otj and 
obtain the same solution 



*j = s g n {otj) - \/ r {ji £ )) 



For d IFS just note that in the case of orthogonality, sequential projections on 
each ClZ(j, e) leads to the same result than the projection on their intersection, 



so a 



IFS 



v L . Then, let us choose Sc{l,..., m} and remark that 



\L —II 



'.L —II 



I GN 
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Now, with assumption CRA, with probability 1 — e, for any j, a satisfies the 
same constraint than the LASSO estimator so 



\{a,ej) -&j\< VH^e) 

and so 

\(a L — a, ej)| = \a* — {a, ef)\ < \a* — aj\ + \(a, ef) — &j\ < 2y/r(j, e). 

Moreover, let us remark that a* is the number with the smallest absolute value 
satisfying this contraint, so 

\a* - (a, e,j)\ < max (\a*j \ , \ (a, ej)|) < \(a, Cj}| . 



So we can conclude 



jeS j$s jes 



□ 



6.3. Proof of Theorem 3-4 

Proof. Note that, for any S: 

m 

II ||2 ^ > , A .2 

\\ a ~ a \\cS = 1^ \ acs ~ a ' e J/GJV 



) 2 

' GN 

i6S j^S 



By the constraint satisfied by a cs we have: 

{a cs ~a,e 3 ) 2 GN < 4r(j,e). 
Moreover, we must remember that Uj = (oi CS , GN satisfies the program 
( argmin„ \u\ 



[s.t. Vj e {1, . . . ,m}, \u -6tj\ < \/r(j,e), 
that is also satisfied by (a, ^j) GN , so \uj\ < \ (a, ej) | and so 

K - (a,ej)\ < m&x(\uj\, \ (a,e 3 ) |) = | (a,ej) \ 
and so we have the relation: 

(a cs -a,ej) GN < (a,ej) GN . 
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So we obtain: 

\\ &CS ~ "lies - H 4r ^' £ ) + Y1 ( s ' e j)c?jv 

This proves the first inequality of the theorem. For the second one, we just have 
to prove that (a cs — a)M G £ m . But this is trivial because of the relation: 

((a cs - a)M, ej) 2 = (a cs - a, e 6 ) GN < (a, e 3 ) GN . 

□ 
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